Lab\(^2\)

On the Incubator for Collaborative and Transparent Economic Sciences | Macartan Humphreys

1 Outline

  • A stocktaking
  • A wishlist

2 Punchlines

Stocktaking

  • Extraordinary gains over 10 years

A workflow-oriented wishlist

  1. Reproduction: Let’s get it out of the box
  2. Registration: let’s agree on a completeness criterion
  3. Reconciliation: let’s get this automated
  4. Reanalysis: let’s have re-analyses justified by design not data
  5. Replication (field replication): let’s confront scope condition confusion
  6. Cumulation: Make coordinated trials rolling
  7. Reporting: let’s communicate better what_we_know.com

2.1 Stocktaking

We started in a bad place

2.2 Stocktaking

Lots of gains

  • Much more open access; Econstor
  • Much greater access to reproduction data. Imperfect, but still major gains.
  • Serious reproduction initiatives (I4R discussion paper series has 163 papers)
  • Much greater reproduction success rates (I4R success rate > 50%)
  • Pre-publication reproduction increasingly common
  • Registration now mainstream
  • Priors information commonly gathered

2.3 Quick check

AER today:

2.4 Quick check

Open access | Open data with doi | Reproduced | Nice badge!

2.5 Implications

  • It’s getting better all the time
  • Let’s keep dreaming: Lab\(^2\) should think big and systematic

3 Reproduction out of the box

3.1 Vultures article

I tried this one yesterday

 doedit "196461-V1\CODE\build_data_run_analysis_complie_PDF.do" 
...
* EDIT THE PATH TO THE REPLICATION FOLDER HERE
* BUT ALSO IN 
* ~/CODE/Python/fuzzy_merge_water.PY
...

* MAKE SURE TO COPY ols_spatial_HAC_W.ado from ~/CODE/Stata/ADO to your local folder
  • So a little bumpy: manual edits; proprietary software; need python and stata installation, TeX via shell pdflatex
  • but it’s there and seems to work

3.2 Discourse article

 doedit "198744-V1\BVWYAnalysisAll.do" 

One click

Not bad really

3.3 What I’d love

library("replicate_anything")

replicate(issn = "0002-828",  what = "fig_1", from_raw = FALSE,  format = "html")

yielding:

# > Replication of fig_1 from Bartling et al 2024 using pre-processed data

3.4 Reproduction out of the box

What I’d love:

library("replicate_anything")

get_code(issn = "0002-828",  what = "tab_1", from_raw = FALSE,  formatted = FALSE)

yielding:

# Code for fig_1 from Bartling et al 2024 using pre-processed data

"
  read.csv("0002-828/processed/fig_1.csv") |> 
  ggplot(..)
  
"  

3.5 Reproduction out of the box

  • And of course: ex ante we would then code and organize our files so as to make it work with replicate_everything

  • And even better: as a web interface to avoid local installations

Can someone build this?

4 Registration completeness

Problem: We are all registering now but we don’t know what we need to register

Deeper issue: Fuzziness about what a design is

4.1 Registration completeness

4.2 DeclareDesign Main idea

  • Think of a research design as an interrogable object: design_1
  • Define: Models, Inquiries, Data Strategies, Answer Strategies
A A a d a d a m a m d D D d I I m m M m m your answer strategy your answer strategy simulated estimate estimate: the answer you'll get a conjectured estimand estimand: the answer you seek simulated data your data strategy your data strategy the data you'll get inquiry: the question you ask inquiry: the question you ask an imagined world an imagined world model: the worlds you consider the real world the real world Theory Empirics Simulations Design Reality

4.3 A design

With DeclareDesign designs can be quite compact and readable:

b <- .3

design_1 <- 
  
  declare_model(
    N = 500,
    u_0 = rnorm(N),
    u_1 = rnorm(N),
    potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_assignment(Z = complete_ra(N = N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, model = lm_robust)

4.4 A complete design

This is a complete design declaration.

  • it can be “run”: run_design(design_1)
  • and “diagnosed” diagnose_design(design_1)
Inquiry Bias RMSE Power Coverage Mean Estimate SD Estimate Mean Se Mean Estimand
ate 0.00 0.10 0.78 0.97 0.30 0.11 0.11 0.30
(0.00) (0.00) (0.02) (0.01) (0.00) (0.00) (0.00) (0.00)
  • We see the design has nice features: reasonable power, unbiasedness
  • Coverage is off though (why?)

4.5 Registration

The DeclareDesign idea is that the design object is the thing you register.

  • The design object captures:

    • your background assumptions about the conditions in which you are operating
    • what your question really is
    • how you plan to generate data
    • how you plan to analyze it
  • It’s short, complete, interrogable.

This is being done but it is a lift.

5 Reconciliation

Issues:

  • We are getting better at regisration but deviations are the norm.

  • How to spot and make sense of deviations?

5.1 In practice

  • We start with design_1

  • Then we implement and we do a few things differently:

    • We gather data on fewer cases than we expected (at random)
    • We have some missingness in outcome measures (not at random)
    • We end up focusing on analysis in a subgroup
  • In addition, we learn things from that data that we only speculated about at the design stage.

5.2 In practice

  • Try to represent the actually implemented research as an alternative design: design_2

  • Reconcile formally: compare_designs(design_1, design_2) to:

    • automate reconciliation
    • clarify nature of deviations
    • clarify implications of deviations
    • invite critique

5.3 Comparisons

If you provide design_2, with adjusted inputs, along with design_1 then readers can compare them.

inquiry estimator diagnosand design1 design2 difference
ate estimator bias 0.00 0.00 -0.00
(0.00) (0.00) (0.00)
ate estimator coverage 0.97 0.96 -0.01
(0.00) (0.00) (0.00)
ate estimator mean_estimand 0.30 0.30 0.00
(0.00) (0.00) (0.00)
ate estimator mean_estimate 0.30 0.30 -0.00
(0.00) (0.00) (0.00)
ate estimator mean_se 0.11 0.12 0.01*
(0.00) (0.00) (0.00)
ate estimator power 0.79 0.74 -0.05*
(0.01) (0.01) (0.01)
ate estimator rmse 0.10 0.11 0.01*
(0.00) (0.00) (0.00)
ate estimator sd_estimate 0.11 0.12 0.01*
(0.00) (0.00) (0.00)
ate estimator type_s_rate 0.00 0.00 0.00
(0.00) (0.00) (0.00)

5.4 Reconciliation and critique

Your reconciliation can clarify:

  1. how things are different
  2. whether things are different in relevant ways (e.g. under optimistic conditions)
  3. where conclusions in fact depend on the reasons for deviations

Proposal: attach both the original (registered) design and the reconciled design to your manuscript.

6 Design-based Critique

These design modification choices could just as easily be made by a different researcher who is re-analyzing your work

A researcher might propose altering the answer strategy, A. Currently common practice is to evaluate this decision based on how results change (robustness) and not (simply) based on ex ante properties.

Instead, justifu reanalysis decision with respect to:

  • **Home ground dominance*. Holding the original model constant (the “home ground” of the original study), show that a new answer strategy yields better diagnosands than the original

  • Robustness to alternative models. Demonstrate that a new answer strategy is robust to both the original model and a new, also plausible, model

7 Replication

There seems to be a lot of confusion about when a (field) replication succeeds or fails in a heterogeneous world

7.1 Replication

Tests of the form:

  • Do both results have the same sign?
  • Is the original effect in the CI of the new result (or vice versa); or even
  • Can we reject the null that these are the same

don’t make too much sense given substantial heterogeneity

7.2 But

  • Refusing to test shifts to non-falsifiability

    • Does Posner et al’s results threaten Björkman and Svennson or not?
  • Obvious: the exercise presupposes we are interested in population estimands not sample estimands: so let’s own that

  • Less obvious: articles (findings) should be accompanied by explicit “generalization claims”: to what populations and what conditions and in how far do results plausible extend?

  • Take away: These are what need to be stated and tested

8 Cumulation

we have gotten better at coordinated trials:

8.1 Cumulation

Though they are almost never on random samples

A rolling trial would have:

  • admission criteria
  • priors on effects
  • sampling information
  • heterogeneity data

8.2 Cumulation

A rolling trial would:

  • provide a basis to determine ex ante if a trial should be implemented
  • imply a workflow designed to aggregate

Will someone set these up?

9 Reporting

Temptations to produce splashy headlines appear unbearable

9.1 Reporting

Need some way to keep focus on cumulative findings: though seems very contrary to instincts

Dream: A combination of this

9.2 Reporting

Need some way to keep focus on cumulative findings: though seems very contrary to instincts

Dream: And this

9.3 Reporting

So we can collectively say: this is what the evidence says, about effects under these and those conditions

All at what_we_know.com

10 So…

  • We are moving rapidly in the right direction
  • Still money on the table
  • Can some of these be done in a way that also makes the lives of researchers easier rather than harder